fix(validator): fail closed on unreadable PAT store instead of wiping it#1486
Open
anderdc wants to merge 1 commit into
Open
fix(validator): fail closed on unreadable PAT store instead of wiping it#1486anderdc wants to merge 1 commit into
anderdc wants to merge 1 commit into
Conversation
A single failed read of miner_pats.json (corrupt file or transient I/O) made _read_file return [], and the next save_pat upserted into that [] and atomically overwrote the store — erasing every other miner's stored PAT, who were then silently scored 0 until each re-posted. _read_file now raises on a corrupt/unreadable file (returns [] only when the file genuinely does not exist). save_pat/get_pat_by_uid propagate, so the write path fails closed and the broadcast handler rejects (miner retries) instead of overwriting. load_all_pats stays tolerant so a transient read failure cannot crash the scoring round. Fixes #1481
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1481
What this fixes
The root cause in #1481 (Proof 2): a single failed read of
miner_pats.jsonsilently and permanently erases every stored PAT._read_file()caught(json.JSONDecodeError, OSError)and returned[], so it could not tell "the store is empty" from "the read just failed" (a corrupt file, a transient I/O blip, a partial write). The nextsave_pat()read that[], upserted the one incoming PAT, and atomically overwrote the file — wiping every other miner's stored PAT. Those miners are then scored 0 by that validator every round (No stored PAT for miner {uid},inspections.py:134) until each manually re-posts, with no error and no signal.The change (validator-side, ~a handful of lines)
_read_file()fails loud — returns[]only when the file genuinely does not exist; raises on a corrupt/unreadable file.save_pat()/get_pat_by_uid()fail closed — they propagate that error, so the write path can no longer mistake a failed read for an empty store and overwrite it. A failed read raises before any write; the existing store is left intact.handle_pat_broadcastrejects gracefully — catches the store-read error and returns a clear"PAT store temporarily unavailable; please retry"rejection, so the miner retries instead of the axon surfacing a raw error.load_all_pats()stays tolerant — it is the read-only scoring snapshot; an unhandled error there would stop the validator (the run loop'sexceptsits outsidewhile True) or zero everyone. It now logs loudly and returns[], so a transient read failure recovers on the next round and never wipes anything.Why this is the appropriate fix
The store loss is the actual bug, and stopping the wipe protects every miner on every validator with no miner action — it doesn't ask honest miners to run a re-broadcast daemon to paper over validator-side data loss. It applies the same principle already adopted in #931 / #932 / #1107: a transient condition must not produce a permanent wrong outcome. Failing closed and loud also surfaces a genuinely corrupt store to the operator instead of silently degrading it down to a single entry.
Tests
tests/validator/test_pat_storage.py(#1481 Proof 2 → regression coverage):save_patraise and leaves the on-disk store fully intact (all prior entries survive; the incoming UID is never written);tests/validator/test_pat_handler.py:handle_pat_broadcastreject (accepted=False,"temporarily unavailable") rather than wiping the store or raising.Follow-up (not in this PR)
This closes the code path that lets a well-configured validator wipe itself. It does not cover PAT loss from a validator restarting without a persistent
./datavolume, or stuck in a crash loop — those are operator config, not code. The follow-up is to reach out to the validators currently missing coverage (thevali-116/rt21-64cases in #1481) to confirm they mount./data:/app/dataand aren't crash-looping.